33 research outputs found

    A multimodal dialog approach to mental state characterization in clinically depressed, anxious, and suicidal populations

    Get PDF
    BackgroundThe rise of depression, anxiety, and suicide rates has led to increased demand for telemedicine-based mental health screening and remote patient monitoring (RPM) solutions to alleviate the burden on, and enhance the efficiency of, mental health practitioners. Multimodal dialog systems (MDS) that conduct on-demand, structured interviews offer a scalable and cost-effective solution to address this need.ObjectiveThis study evaluates the feasibility of a cloud based MDS agent, Tina, for mental state characterization in participants with depression, anxiety, and suicide risk.MethodSixty-eight participants were recruited through an online health registry and completed 73 sessions, with 15 (20.6%), 21 (28.8%), and 26 (35.6%) sessions screening positive for depression, anxiety, and suicide risk, respectively using conventional screening instruments. Participants then interacted with Tina as they completed a structured interview designed to elicit calibrated, open-ended responses regarding the participants' feelings and emotional state. Simultaneously, the platform streamed their speech and video recordings in real-time to a HIPAA-compliant cloud server, to compute speech, language, and facial movement-based biomarkers. After their sessions, participants completed user experience surveys. Machine learning models were developed using extracted features and evaluated with the area under the receiver operating characteristic curve (AUC).ResultsFor both depression and suicide risk, affected individuals tended to have a higher percent pause time, while those positive for anxiety showed reduced lip movement relative to healthy controls. In terms of single-modality classification models, speech features performed best for depression (AUC = 0.64; 95% CI = 0.51–0.78), facial features for anxiety (AUC = 0.57; 95% CI = 0.43–0.71), and text features for suicide risk (AUC = 0.65; 95% CI = 0.52–0.78). Best overall performance was achieved by decision fusion of all models in identifying suicide risk (AUC = 0.76; 95% CI = 0.65–0.87). Participants reported the experience comfortable and shared their feelings.ConclusionMDS is a feasible, useful, effective, and interpretable solution for RPM in real-world clinical depression, anxiety, and suicidal populations. Facial information is more informative for anxiety classification, while speech and language are more discriminative of depression and suicidality markers. In general, combining speech, language, and facial information improved model performance on all classification tasks

    Toward understanding speech planning by observing its execution – Representations, modeling and analysis

    No full text
    This thesis proposes a balanced framework toward understanding speech motor planning and control by observing aspects of its behavioral execution. To this end, it proposes representing, modeling, and analyzing real-time speech articulation data from both `top-down' (or knowledge-driven) as well as `bottom-up' (or data-driven) perspectives. The first part of the thesis uses existing knowledge from linguistics and motor control to extract meaningful representations from real-time magnetic resonance imaging (rtMRI) data, and further, posit and test specific hypotheses regarding kinematic and postural planning during pausing behavior. In the former case, we propose a measure to quantify the speed of articulators during pauses as well as during their immediate neighborhoods. Using appropriate statistical analysis techniques, we find support for the hypothesis that pauses at major syntactic boundaries (i.e., grammatical pauses), but not ungrammatical (e.g., word search) pauses, are planned by a high-level cognitive mechanism that also controls the rate of articulation around these junctures. In the latter case, we present a novel automatic procedure to characterize vocal posture from rtMRI data. Statistical analyses suggest that articulatory settings differ during rest positions, ready positions and inter-speech pauses, and might, in that order, involve an increasing degree of active control by the cognitive speech planning mechanism. We show that this may be due to the fact that postures assumed during pauses are significantly more mechanically advantageous than postures assumed during absolute rest. In other words, inter-speech postures allow for a larger change in the space of motor control tasks/goals for a minimal change in the articulatory posture space as compared to postures at absolute rest. We argue that such top-down approaches can be used to augment models of speech motor control. The second part of the thesis presents a computational, data-driven approach to derive interpretable movement primitives from speech articulation data in a bottom-up manner. It puts forth a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given data matrix into a set of spatiotemporal basis sequences and an activation matrix. The algorithm optimizes a cost function that trades off the mismatch between the proposed model and the input data against the number of primitives that are active at any given instant. The method is applied to both measured articulatory data obtained through electromagnetic articulography (EMA) as well as synthetic data generated using an articulatory synthesizer. The paper then describes how to evaluate the algorithm performance quantitatively and further performs a qualitative assessment of the algorithm's ability to recover compositional structure from data. The results suggest that the proposed algorithm extracts movement primitives from human speech production data that are linguistically interpretable. We further examine how well derived representations of "primitive movements'' of speech articulation can be used to classify broad phone categories, and thus provide more insights into the link between speech production and perception. We finally show that such primitives can be mathematically modeled using nonlinear dynamical systems in a control-theoretic framework for speech motor control. Such a primitives-based framework could thus help inform practicable theories of speech motor control and coordination

    Vocal tract cross-distance estimation from real-time MRI using region-of-interest analysis

    No full text
    Real-Time Magnetic Resonance Imaging affords speech articulation data with good spatial and temporal resolution and complete midsagittal views of the moving vocal tract, but also brings many challenges in the domain of image processing and analysis. Region-of-interest analysis has previously been proposed for simple, efficient and robust extraction of linguistically-meaningful constriction degree information. However, the accuracy of such methods has not been rigorously evaluated, and no method has been proposed to calibrate the pixel intensity values or convert them into absolute measurements of length. This work provides such an evaluation, as well as insights into the placement of regions in the image plane and calibration of the resultant pixel intensity measurements. Measurement errors are shown to be generally at or below the spatial resolution of the imaging protocol with a high degree of consistency across time and overall vocal tract configuration, validating the utility of this method of image analysis.4 page(s

    Are articulatory settings mechanically advantageous for speech motor control?

    No full text
    We address the hypothesis that postures adopted during grammatical pauses in speech production are more "mechanically advantageous" than absolute rest positions for facilitating efficient postural motor control of vocal tract articulators. We quantify vocal tract posture corresponding to inter-speech pauses, absolute rest intervals as well as vowel and consonant intervals using automated analysis of video captured with real-time magnetic resonance imaging during production of read and spontaneous speech by 5 healthy speakers of American English. We then use locally-weighted linear regression to estimate the articulatory forward map from low-level articulator variables to high-level task/goal variables for these postures. We quantify the overall magnitude of the first derivative of the forward map as a measure of mechanical advantage. We find that postures assumed during grammatical pauses in speech as well as speech-ready postures are significantly more mechanically advantageous than postures assumed during absolute rest. Further, these postures represent empirical extremes of mechanical advantage, between which lie the postures assumed during various vowels and consonants. Relative mechanical advantage of different postures might be an important physical constraint influencing planning and control of speech production

    What Do Patients Say About Their Disease Symptoms? Deep Multilabel Text Classification With Human-in-the-Loop Curation for Automatic Labeling of Patient Self Reports of Problems

    Full text link
    The USA Food and Drug Administration has accorded increasing importance to patient-reported problems in clinical and research settings. In this paper, we explore one of the largest online datasets comprising 170,141 open-ended self-reported responses (called "verbatims") from patients with Parkinson's (PwPs) to questions about what bothers them about their Parkinson's Disease and how it affects their daily functioning, also known as the Parkinson's Disease Patient Report of Problems. Classifying such verbatims into multiple clinically relevant symptom categories is an important problem and requires multiple steps - expert curation, a multi-label text classification (MLTC) approach and large amounts of labelled training data. Further, human annotation of such large datasets is tedious and expensive. We present a novel solution to this problem where we build a baseline dataset using 2,341 (of the 170,141) verbatims annotated by nine curators including clinical experts and PwPs. We develop a rules based linguistic-dictionary using NLP techniques and graph database-based expert phrase-query system to scale the annotation to the remaining cohort generating the machine annotated dataset, and finally build a Keras-Tensorflow based MLTC model for both datasets. The machine annotated model significantly outperforms the baseline model with a F1-score of 95% across 65 symptom categories on a held-out test set

    Simulated speaking environments for language learning: insights from three cases

    No full text
    Abstract: Recent CALL technology reviews cover a plethora of technologies available to language learners to improve a variety of skills, including speaking. However, few technology-enhanced self-access tools are available for pragmatic development, especially in oral modality. Recognizing the benefits of structured practice for second language development, we demonstrate how such practice can be incorporated into three recently developed simulated speaking environments that vary on the targeted L2 (French, English), domain of use (academic or everyday interaction), emphasis on higher-order and/or lower-order skills, and accommodation of multiple L2 varieties. In the spirit of finding synergies and learning from each other's experiences in specific local contexts, we address the following research questions: (1) How does the local context, researcher and learner goals, and technological possibilities influence the design of each computer application? (2) Based on the examination of the three programs, what can we learn in view of redesign options and suggest to future developers of such programs
    corecore